{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepare Dataset for Bias Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Amazon Customer Reviews Dataset\n",
    "\n",
    "https://s3.amazonaws.com/dsoaws/amazon-reviews-pds/readme.html\n",
    "\n",
    "## Schema\n",
    "\n",
    "- `marketplace`: 2-letter country code (in this case all \"US\").\n",
    "- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.\n",
    "- `review_id`: A unique ID for the review.\n",
    "- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.\n",
    "- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.\n",
    "- `product_title`: Title description of the product.\n",
    "- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).\n",
    "- `star_rating`: The review's rating (1 to 5 stars).\n",
    "- `helpful_votes`: Number of helpful votes for the review.\n",
    "- `total_votes`: Number of total votes the review received.\n",
    "- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?\n",
    "- `verified_purchase`: Was the review from a verified purchase?\n",
    "- `review_headline`: The title of the review itself.\n",
    "- `review_body`: The text of the review.\n",
    "- `review_date`: The date the review was written."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Dataset Files\n",
    "\n",
    "Let's start by retrieving a subset of the Amazon Customer Reviews dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp 's3://dsoaws/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz' ./data-clarify/\n",
    "!aws s3 cp 's3://dsoaws/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data-clarify/\n",
    "!aws s3 cp 's3://dsoaws/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz' ./data-clarify/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "df_giftcards = pd.read_csv(\n",
    "    \"./data-clarify/amazon_reviews_us_Gift_Card_v1_00.tsv.gz\",\n",
    "    delimiter=\"\\t\",\n",
    "    quoting=csv.QUOTE_NONE,\n",
    "    compression=\"gzip\",\n",
    ")\n",
    "df_giftcards.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_giftcards.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "df_software = pd.read_csv(\n",
    "    \"./data-clarify/amazon_reviews_us_Digital_Software_v1_00.tsv.gz\",\n",
    "    delimiter=\"\\t\",\n",
    "    quoting=csv.QUOTE_NONE,\n",
    "    compression=\"gzip\",\n",
    ")\n",
    "df_software.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_software.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "df_videogames = pd.read_csv(\n",
    "    \"./data-clarify/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz\",\n",
    "    delimiter=\"\\t\",\n",
    "    quoting=csv.QUOTE_NONE,\n",
    "    compression=\"gzip\",\n",
    ")\n",
    "df_videogames.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_videogames.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format='retina'\n",
    "\n",
    "df_giftcards[[\"star_rating\", \"review_id\"]].groupby(\"star_rating\").count().plot(\n",
    "    kind=\"bar\", title=\"Breakdown by Star Rating\"\n",
    ")\n",
    "plt.xlabel(\"Star Rating\")\n",
    "plt.ylabel(\"Review Count\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format='retina'\n",
    "\n",
    "df_software[[\"star_rating\", \"review_id\"]].groupby(\"star_rating\").count().plot(\n",
    "    kind=\"bar\", title=\"Breakdown by Star Rating\"\n",
    ")\n",
    "plt.xlabel(\"Star Rating\")\n",
    "plt.ylabel(\"Review Count\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format='retina'\n",
    "\n",
    "df_videogames[[\"star_rating\", \"review_id\"]].groupby(\"star_rating\").count().plot(\n",
    "    kind=\"bar\", title=\"Breakdown by Star Rating\"\n",
    ")\n",
    "plt.xlabel(\"Star Rating\")\n",
    "plt.ylabel(\"Review Count\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Combine Data Frames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_giftcards.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_software.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_videogames.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.concat([df_giftcards, df_software, df_videogames], ignore_index=True, sort=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.countplot(data=df, x=\"star_rating\", hue=\"product_category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Balance the Dataset by Product Category and Star Rating"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_grouped_by = df.groupby([\"product_category\", \"star_rating\"])[[\"product_category\", \"star_rating\"]]\n",
    "df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))\n",
    "df_balanced.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.countplot(data=df_balanced, x=\"star_rating\", hue=\"product_category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Write a CSV with Header"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unbalanced label classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"./data-clarify/amazon_reviews_us_giftcards_software_videogames.csv\"\n",
    "df.to_csv(path, index=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Balanced label classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_balanced.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_balanced = \"./data-clarify/amazon_reviews_us_giftcards_software_videogames_balanced.csv\"\n",
    "df_balanced.to_csv(path_balanced, index=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Write as JSONLINES"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_jsonlines = \"./data-clarify/amazon_reviews_us_giftcards_software_videogames_balanced.jsonl\"\n",
    "df_balanced.to_json(path_or_buf=path_jsonlines, orient=\"records\", lines=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Upload Train Data to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "timestamp = int(time.time())\n",
    "\n",
    "bias_data_s3_uri = sess.upload_data(bucket=bucket, key_prefix=\"bias-detection-{}\".format(timestamp), path=path)\n",
    "bias_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $bias_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "balanced_bias_data_s3_uri = sess.upload_data(\n",
    "    bucket=bucket, key_prefix=\"bias-detection-{}\".format(timestamp), path=path_balanced\n",
    ")\n",
    "balanced_bias_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $balanced_bias_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "balanced_bias_data_jsonlines_s3_uri = sess.upload_data(\n",
    "    bucket=bucket, key_prefix=\"bias-detection-{}\".format(timestamp), path=path_jsonlines\n",
    ")\n",
    "balanced_bias_data_jsonlines_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $balanced_bias_data_jsonlines_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Store Variables for Next Notebook(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store balanced_bias_data_jsonlines_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store balanced_bias_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store bias_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Release Resources\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%html\n",
    "\n",
    "<p><b>Shutting down your kernel for this notebook to release resources.</b></p>\n",
    "<button class=\"sm-command-button\" data-commandlinker-command=\"kernelmenu:shutdown\" style=\"display:none;\">Shutdown Kernel</button>\n",
    "        \n",
    "<script>\n",
    "try {\n",
    "    els = document.getElementsByClassName(\"sm-command-button\");\n",
    "    els[0].click();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}    \n",
    "</script>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%javascript\n",
    "\n",
    "try {\n",
    "    Jupyter.notebook.save_checkpoint();\n",
    "    Jupyter.notebook.session.delete();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
